Skip to content

Conversation

@xylar
Copy link
Collaborator

@xylar xylar commented Oct 31, 2025

This is not a configuration supported by E3SM so we need to provide the appropriate modules.

Checklist

  • Testing comment, if appropriate, in the PR documents testing used to verify the changes

This is not a configuration supported by E3SM so we need to provide
the appropriate modules.
@xylar xylar self-assigned this Oct 31, 2025
@xylar xylar added the bug Something isn't working label Oct 31, 2025
@xylar
Copy link
Collaborator Author

xylar commented Oct 31, 2025

Testing

I deployed E3SM-Unified 1.12.0rc3 with this fix in my scratch space on Compy. I then ran:

#!/bin/bash -l
#SBATCH --nodes=1
#SBATCH --time=2:00:00
#SBATCH -A e3sm
#SBATCH -p slurm
#SBATCH --qos regular
#SBATCH --job-name=mpi4py
#SBATCH --output=mpi4py.o%j
#SBATCH --error=mpi4py.e%j

set -e


source /compyfs/asay932/test_e3sm_unified/test_e3sm_unified_1.12.0rc3_compy.sh

python -c "from mpi4py import MPI"
python -c "from ILAMB.ModelResult import ModelResult"

I saw some warnings (repeated twice) but no errors:

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              n0416
  Local adapter:           hfi1_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   n0416
  Local device: hfi1_0
--------------------------------------------------------------------------

@xylar
Copy link
Collaborator Author

xylar commented Oct 31, 2025

We will need to find out from zppy testing if this warning is a problem or not.

@forsyth2
Copy link
Contributor

forsyth2 commented Nov 3, 2025

@xylar @andrewdnolan I'm seeing the same exact shared library error on Perlmutter. I imagine it needs the same fix.

@xylar
Copy link
Collaborator Author

xylar commented Nov 3, 2025

@forsyth2, oh, that's too bad. No, I wouldn't think it's the same fix.

@forsyth2
Copy link
Contributor

forsyth2 commented Nov 3, 2025

Oh, hmm. Does that indicate it's more likely something on zppy's side then? Or still on the Unified environment side?

@xylar
Copy link
Collaborator Author

xylar commented Nov 3, 2025

I'm running a test but there were a bunch of missing modules on the Compy side that clearly explain the problem. There's nothing similar on the Perlmutter side so I'm baffled.

@xylar
Copy link
Collaborator Author

xylar commented Nov 3, 2025

It could be that some modules are wrong or something. No idea...

@andrewdnolan
Copy link
Collaborator

Testing on perlmuttter:

salloc --nodes 1 --qos interactive --time 00:15:00 --constraint cpu --account e3sm 

source /global/common/software/e3sm/anaconda_envs/test_e3sm_unified_1.12.0rc3_pm-cpu.sh

python -c "import mpi4py, xarray; print('mpi4py:', mpi4py.__version__)"
#mpi4py: 4.1.1 

srun -n 2 python -c "from mpi4py import MPI; print(MPI.COMM_WORLD.Get_size())"
# Traceback (most recent call last):                                                                                                                         
# Traceback (most recent call last):                                                                                                                         
#  File "<string>", line 1, in <module>                 
#     from mpi4py import MPI; print(MPI.COMM_WORLD.Get_size())                                                                                               
#     ^^^^^^^^^^^^^^^^^^^^^^                   
# ImportError: libmpi.so.12: cannot open shared object file: No such file or directory
#   File "<string>", line 1, in <module>
#     from mpi4py import MPI; print(MPI.COMM_WORLD.Get_size())                                                                                               
#     ^^^^^^^^^^^^^^^^^^^^^^

Looks like the issue is only when using the srun executable.

@forsyth2
Copy link
Contributor

forsyth2 commented Nov 3, 2025

Looking at the ilamb bash template, srun only appears here:

echo
echo ===== RUN ILAMB =====
echo

# Run diagnostics
# Not required TODO?
# TODO: find the mpi run format for different platforms

# include cfg file
cat > ilamb.cfg << EOF
{% include cfg %}
EOF

echo ${workdir}
echo {{ scriptDir }}

srun -N 1 ilamb-run --config ilamb.cfg --model_root $model_root  --regions global --model_year ${Y1} 2000

I also see that TODO about mpi run format... is that something that is required??

@xylar
Copy link
Collaborator Author

xylar commented Nov 3, 2025

This is not the right place to have this discussion. Let's make an issue or move to Slack.

@andrewdnolan andrewdnolan merged commit cdbe0d3 into E3SM-Project:main Nov 4, 2025
5 checks passed
@xylar xylar deleted the add-compy-gnu-shell-scripts branch November 7, 2025 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants